Skip to content

Conversation

@Neelabh94
Copy link
Contributor

@Neelabh94 Neelabh94 commented Sep 22, 2025

Rationale for the Change
The original implementation used TensorFlow jobs to test the GKE Hyperdisk volumes. However, the test has been failing continuously due to changes in dependency module versions (e.g., transformers and datasets).
Pinning dependency versions was deemed an inadequate fix, as the test should not rely on a specific version of a non-core component like the TensorFlow example.

Implementation Details
Following an offline discussion, the approach has been changed to replace the problematic TensorFlow jobs with FIO (Flexible I/O Tester) jobs. FIO is a standard benchmarking and stress tool used to measure I/O performance.

  • The change in examples/gke-managed-hyperdisk.yaml replaces the three separate TensorFlow jobs (for Hyperdisk Balanced, Extreme, and Throughput) with a single FIO-based benchmark job.
  • The new FIO job is configured to perform I/O tests against the Hyperdisk volumes, including a randrw test for Balanced, a randwrite test for Extreme, and a write test for Throughput.
  • The deployment now uses a simpler ubuntu:latest image, installs FIO, and executes the benchmarks.
  • The validation logic in tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-managed-hyperdisk.yml has been updated to specifically run the single FIO job, wait for its completion, fetch the pod logs, and print the FIO test results.
  • This approach ensures the test focuses squarely on the I/O performance and mounting of the Hyperdisk volumes rather than being dependent on external machine learning libraries and their changing APIs.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@Neelabh94 Neelabh94 added the test-enhancement Tests enhancement or coverage improvement label Sep 22, 2025
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from dd19029 to 11b9cc6 Compare September 24, 2025 05:29
@Neelabh94 Neelabh94 self-assigned this Sep 24, 2025
@Neelabh94 Neelabh94 changed the title fix(gke): add compute.admin role to GKE node pool service account fix(gke-hyperdisk-test): Pin the dependency versions Sep 24, 2025
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from 5eca840 to b6a98ca Compare September 24, 2025 06:23
@Neelabh94 Neelabh94 marked this pull request as ready for review September 24, 2025 09:47
@Neelabh94 Neelabh94 requested review from a team and samskillman as code owners September 24, 2025 09:47
@Neelabh94 Neelabh94 added the release-chore To not include into release notes label Sep 24, 2025
Copy link
Collaborator

@bytetwin bytetwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to the RCA on how pinning the dependency version succeeds the test. The test should not be dependent on a specific version of transformer as its not the core part of the blueprint to test.
Moreover cloud docs for running tensorflow examples are not pinning the dependencies.

@Neelabh94
Copy link
Contributor Author

Can you point to the RCA on how pinning the dependency version succeeds the test. The test should not be dependent on a specific version of transformer as its not the core part of the blueprint to test. Moreover cloud docs for running tensorflow examples are not pinning the dependencies.

The test uses these python libraries in a job it executes. The job (a python code) breaks due to change in the internal behavior of these libraries. Pinning them ensure the code behaves in a consistent manner.
I am not fully aware about the update strategy for tensorflow cloud docs but I do believe any breaking change must warrant a change to their doc(s) / example(s) too.

Copy link
Collaborator

@bytetwin bytetwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failures in test should not be propagated as changes to blueprints. The test needs to be fixed to accomodate to this breaking change. Customers when using these models for their workloads will not be using the same version that is pinned here and will find discrepancies

@Neelabh94
Copy link
Contributor Author

Failures in test should not be propagated as changes to blueprints. The test needs to be fixed to accomodate to this breaking change. Customers when using these models for their workloads will not be using the same version that is pinned here and will find discrepancies

As per our discussion offline, will work on replacing the tensorflow jobs with FIO jobs.

@Neelabh94 Neelabh94 dismissed bytetwin’s stale review October 8, 2025 08:25

Discussed offline

@Neelabh94 Neelabh94 added do-not-merge Block merging of this PR and removed test-enhancement Tests enhancement or coverage improvement labels Oct 8, 2025
@Neelabh94 Neelabh94 marked this pull request as draft October 8, 2025 08:27
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 3 times, most recently from 5983077 to 60e5181 Compare October 16, 2025 07:31
@Neelabh94 Neelabh94 changed the title fix(gke-hyperdisk-test): Pin the dependency versions fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs Oct 16, 2025
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 3 times, most recently from 61351ab to ed694da Compare October 16, 2025 15:50
@Neelabh94 Neelabh94 removed the do-not-merge Block merging of this PR label Oct 16, 2025
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 2 times, most recently from 413f662 to 0daf86c Compare October 16, 2025 16:34
@Neelabh94 Neelabh94 marked this pull request as ready for review October 16, 2025 17:06
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 2 times, most recently from ba47760 to e40b125 Compare October 17, 2025 02:47
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 2 times, most recently from 5c579e7 to e892a94 Compare October 23, 2025 02:43
@Neelabh94 Neelabh94 marked this pull request as draft October 24, 2025 11:10
…ample blueprint

Replaces the tensorflow jobs with FIO jobs in the `gke-managed-hyperdisk` example and
accordingly updates its correspoding test playbook.
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from 0ee1ef2 to 2f7f378 Compare October 24, 2025 12:12
@Neelabh94 Neelabh94 marked this pull request as ready for review October 24, 2025 12:55
@Neelabh94
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, replacing the flaky TensorFlow tests with more stable and focused FIO benchmarks. The implementation looks solid. I have a few suggestions to improve security, reproducibility, and robustness of the test scripts.

In gke-managed-hyperdisk.yaml, I've pointed out a security concern with running the container as root and suggested pinning the ubuntu image version for reproducibility. I also recommended adding a trap to the bash script to ensure cleanup happens even if a benchmark fails.

In test-gke-managed-hyperdisk.yml, I've suggested a more robust way to wait for the job completion and a cleanup of the logging tasks to avoid redundancy.

@Neelabh94 Neelabh94 merged commit 3405b56 into GoogleCloudPlatform:develop Oct 30, 2025
13 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants